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Abstract 

Many methods have been proposed for community detection in networks, but most of them do not 
take into account additional information on the nodes that is often available in practice. In this paper, 
we propose a new joint community detection criterion that uses both the network edge information 
and the node features to detect community structures. One advantage our method has over existing 
joint detection approaches is the flexibility of learning the impact of different features which may differ 
across communities. Another advantage is the flexibility of choosing the amount of influence the feature 
information has on communities. The method is asymptotically consistent under the block model with 
additional assumptions on the feature distributions, and performs well on simulated and real networks. 

Community detection is a fundamental problem in network analysis, extensively studied in a number of 
domains - see m and (ji} for some examples of applications. A number of approaches to community detection 
are based on probabilistic models for networks with communities, such as the stochastic block model 
the degree-corrected stochastic block model (@]), and the latent factor model ©■ Other approaches work by 
optimizing a criterion measuring the strength of community structure in some sense, often through spectral 
approximations. Examples include normalized cuts modularity d ED, and many variants of spectral 
clustering, e.g., ©■ 

Many of the existing methods detect communities based only on the network adjacency matrix. However, 
we often have additional information on the nodes (node features), and sometimes edges as well, for example, 
dinD , hid and ED- In many networks the distribution of node features is correlated with community struc¬ 
ture ED, and thus a natural question is whether we can improve community detection by using the node 
features. Several generative models for jointly modeling the edges and the features have been proposed, in¬ 
cluding the network random effects model ED, the embedding feature model ED, the latent variable model 
ED, the discriminative approach ED, the latent multi-group membership graph model ED, the social cir¬ 
cles model for ego networks ED, the communities from edge structure and node attributes (CESNA) model 
ED, the Bayesian Graph Clustering (BAGC) model (120D and the topical communities and personal interest 
(TCPI) model (1221) . Most of these models are designed for specific feature types, and their effectiveness 
depends heavily on the correctness of model specification. Model-free approaches include weighted combina¬ 
tions of the network and feature similarities (1251 (Ml) , attribute-structure mining (1251) . simulated annealing 
clustering (1251) . and compressive information flow (1271) . Most methods in this category use all the features in 
the same way without determining which ones influence the community structure and which do not, and lack 
flexibility in how to balance the network information with the information coming from its node features, 
which do not always agree. Including irrelevant node features can only hurt community detection by adding 
in noise, while selecting features that by themselves cluster strongly may not correspond to features that 
correlate with the community structure present in the adjacency matrix. 

In this paper, we propose a new joint community detection criterion that uses both the network ad¬ 
jacency matrix and the node features. The idea is that by properly weighing edges according to feature 
similarities on their end nodes, we strengthen the community structure in the network thus making it easier 
to detect. Rather than using all available features in the same way, we learn which features are most helpful 
in identifying the community structure from data. Intuitively, our method looks for an agreement between 
clusters suggested by two data sources, the adjacency matrix and the node features. Numerical experiments 
on simulated and real networks show that our method performs well compared to methods that use either 
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the network alone or the features alone for clustering, as well as to a number of benchmark joint detection 
methods. 


1 The joint community detection criterion 


Our method is designed to look for assortative community structure, that is, the type of communities where 
nodes are more likely to connect to each other if they belong to the same community, and thus there are 
more edges within communities than between. This is a very common intuitive definition of communities 
which is incorporated in many community detection criteria, for example, modularity ®. Our goal is to 
use such a community detection criterion based on the adjacency matrix alone, and add feature-based edge 
weights to improve detection. Several criteria using the adjacency matrix alone are available, but having 
a simple criterion linear in the adjacency matrix makes optimization much more feasible in our particular 
situation, and we propose a new criterion which turns out to work particularly well for our purposes. Let A 
denote the adjacency matrix with A, 7 = 0 if there is no edge between nodes i and j, and otherwise Aij > 0 
which can be either 1 for unweighted networks or the edge weight for weighted networks. The community 
detection criterion we start from is a very simple analogue of modularity, to be maximized over all possible 
label assignments e: 

= £ Au. £ A >, ■ (i-i) 

fc=i 

Here e is the vector of node labels, with e* = k if node i belongs to community k, for k = 1,..., K, 
£k = {i ■ ei = fc}, and |£fc| is the number of nodes in community k. We assume each node belongs to exactly 
one community, and the number of communities K is fixed and known. Rescaling by \£k\ a is designed to 
rule out trivial solutions that put all nodes in the same community, and a > 0 is a tuning parameter. When 
a = 2, the criterion is approximately the sum of edge densities within communities, and when a = 1, the 
criterion is the sum of average “within community” degrees, which both intuitively represent community 
structure. This criterion can be shown to be consistent under the stochastic block model by checking the 
conditions of the general theorem in (|251) . 

The ideal use of features with this criterion would be to use them to up-weigh edges within communities 
and down-weigh edges between them, thus enhancing the community structure in the observed network 
and making it easier to detect. However, node features may not be perfectly correlated with community 
structure, different communities may be driven by different features, as pointed out by and features 
themselves may be noisy. Thus we need to learn the impact of different features on communities as well as 
balance the roles of the network itself and its features. Let fi denote the p-dimensional feature vector of 
node i. We propose a joint community detection criterion (JCDC), 
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where a is a tuning parameter as in (1.1), (3k € is the coefficient vector that defines the impact of 


different features on the fcth community, and (3 := {/dp... ,/3k}. The criterion is then maximized over both 
e and (3. Having a different (3k for each k allows us to learn the roles different features may play in different 
communities. The balance between the information from A and F := {fi, ■ ■ ■, f n } is controlled by w n , 
another tuning parameter which in general may depend on n. 

For the sake of simplicity, we model the edge weight W(fi, fj, (3k\w n ) as a function of the node features 
fi and fj via a p-dimensional vector of their similarity measures <f>ij = (f>(fi,fj)- The choice of similarity 
measures in (f> depends on the type of /,; (for example, on whether the features are numerical or categorical) 
and is determined on a case by case basis; the only important property is that (f> assigns higher values to 
features that are more similar. Note that this trivially allows the inclusion of edge features as well as node 
features, as long as they are converted to some sort of similarity. To eliminate potential differences in units 
and scales, we standardize all <f>ij along each feature dimension. Finally, the function W should be increasing 
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in ( (f>ij,l3 ), which can be viewed as the “overall similarity” between nodes, and for optimization purposes it 
is convenient to take W to be concave. Here we use the exponential function, 

w ijk = W(fi,fj,/3 k ;wn ) = w n - (1.3) 

One can use other functions of similar shapes, for example, the logit exponential function, which we found 
empirically to perform similarly. 


2 Estimation 

The joint community detection criterion needs to be optimized over both the community assignments e and 
the feature parameters /3. Using block coordinate descent, we optimize JCDC by alternately optimizing over 
the labels with fixed parameters and over the parameters with fixed labels, and iterating until convergence. 


2.1 Optimizing over label assignments with fixed weights 

When parameters /? are fixed, all edge weights Wijk’s can be treated as known constants. It is infeasible to 
search over all n K possible label assignments, and, like many other community detection methods, we rely 
on a greedy label switching algorithm to optimize over e, specifically, the tabu search (EHD, which updates 
the label of one node at a time. Since our criterion involves the number of nodes in each community \£ k \ ■ no 
easy spectral approximations are available. Fortunately, our method allows for a simple local approximate 
update which does not require recalculating the entire criterion. For a given node i considered for label 
switching, the algorithm will assign it to community k rather than l if 
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where S k k is twice the total edge weights in community k, and S^k is the sum of edge weights between 
node i and all the nodes in £ k . When \£ k \ and \£i\ are large, we can ignore +1 in the denominators, and 
(2.1) becomes 

Si++k \£k\ X “ ^ Si^l fr, o\ 

141 ' I SI 1 "" 141 ’ 1 ' j 

which allows for a “local” update for the label of node i without calculating the entire criterion. This also 
highlights the impact of the tuning parameter a: when a = 1, the two sides of (2.2) can be viewed as 


averaged weights of all edges connecting node i to communities £ k and £i , respectively. Then our method 
assigns node i to the community with which it has the strongest connection. When a/1, the left hand side 
of (2.2) is multiplied by a factor {\£k\/\£i\) 1 ~ a ■ Suppose \£ k \ is larger than |£j|; then choosing 0 < a < 1 
indicates a preference for assigning a node to the larger community, while a > 1 favors smaller communities. 
A detailed numerical investigation of the role of a is provided in the Supplemental Material. 


The edge weights involved in (2.2) depend on the tuning parameter w n . When j3 = 0, all weights are equal 
to w n — 1. On the other hand, Wijk < w n for all values of /3. Therefore, w n /(w n — 1) is the maximum amount 
by which our method can reweigh an edge. When w n is large, w n /(w n — 1) « 1, and thus the information 
from the network structure dominates. When w n is close to 1, the ratio is large and the feature-driven edge 
weights have a large impact. See the Supplemental Material for more details on the choice of w n . 

While the tuning parameter w n controls the amount of influence features can have on community detec¬ 
tion, it does not affect the estimated parameters fi for a fixed community assignment. This is easy to see 


from rearranging terms in (1.2): 
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where the function g does not depend on w n . Note that the term containing w n does not depend on /3. 
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2.2 Optimizing over weights with fixed label assignments 


Since we chose a concave edge weight function (1.31, for a given community assignment e the joint criterion is 
a concave function of /?*., and it is straightforward to optimize over by gradient ascent. The role of (3 k is to 
control the impact of different features on each community. One can show by a Taylor-series type expansion 
around the maximum (details omitted) and also observe empirically that for our method, the estimated /3k’s 
are correlated with the feature similarities between nodes in community k. In other words, our method tends 
to produce a large estimated p y k ; for a feature with high similarity values <f>\P s for i, j £ £ k ■ However, in 

the extreme case, the optimal /3^ can be +oo if all </>^’s are positive in community k or —oo if all s 
are negative (recall that similarities are standardized, so this cannot happen in all communities). To avoid 


these extreme solutions, we subtract a penalty term A||/3||i from the criterion (1.2) while optimizing over /3. 
We use a very small value of A (A = 10“ 5 everywhere in the paper) which safeguards against numerically 
unstable solutions but has very little effect on other estimated coefficients. 


3 Consistency 


The proposed JCDC criterion (1.2) is not model-based, but under certain models it is asymptotically consis¬ 
tent. We consider the setting where the network A and the features F are generated independently from a 
stochastic block model and a uniformly bounded distribution, respectively. Let P(Ay = 1) = p n P Ci c where 
p n is a factor controling the overall edge density and c = (ci,..., c n ) is the vector of true labels. Assume 
the following regularity conditions hold: 


1. There exist global constants M^ and Mp, such that 11^112 < M^> and ||/3fc|| 2 < Mp for all k, and the 
tuning parameter w n satisfies logui„ > M^Mp. 

2. Let Ck := {i : Ci = k}. There exists a global constant ttq such that \Ck\ > 7Tq n > 0 for all k. 

3. For all 1 < k < l < K , 2(AT — 1 )Pki < min (P k k,Pu)- 


Condition [l] states that node feature similarities are uniformly bounded. This is a mild condition in 
many applications as the node features are often themselves uniformly bounded. In practice, for numerical 
stability the user may want to standardize node features and discard individual features with very low 
variance, before calculating the corresponding similarities <f>. Condition [2] guarantees communities do not 
vanish asymptotically. Condition [3] enforces assort at ivity. Since the estimated labels e are only defined 
up to an arbitrary permutation of communities, we measure the agreement betwee e and c by d(e, c ) = 
mil io-ep K y Y^i= 1 l( t7 ( e i) 7^ Ci)> where Vk is the set of all permutations of {1,..., A'}. 


Theorem 1 (Consistency of JCDC). Under conditions [7J emd[5j if np n — > 00 , w n p n 
parameter a satisfies 

maxt ; 2{I\ - 1 )P ki 
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then we have, for any fixed d > 0, 
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The proof is given in the Supplemental Material. 


4 Simulation studies 

We compare JCDC to three representative benchmark methods which use both the adjacency matrix and 
the node features: CASC (Covariate Assisted Spectral Clustering, (PHlO . CESNA (Communities from Edge 
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Structure and Node Attributes, m), and BAGC (BAyesian Graph Clustering, ( 1201 ) 1 . In addition, we also 
include two standard methods that use either the network adjacency alone (SC, spectral clustering on the 
Laplacian regularized with a small constant r = le — 7, as in (EH), or the node features alone (KM, K- 
means performed on the p-dimensional node feature vectors, with 10 random initial starting values). We 
generate networks with n = 150 nodes and K = 2 communities of sizes 100 and 50 from the degree-corrected 
stochastic block model as follows. The edges are generated independently with probability didjp if nodes i 
and j are in the same community, and rOiOjp if nodes i and j are in different communities. We set p = 0.1 
and vary r from 0.25 to 0.75. We set 5% of the nodes in each community to be “hub” nodes with the degree 
correction parameter 9i = 10, and for the remaining nodes set 9i = 1. All resulting products are thresholded 
at 0.99 to ensure there are no probability values over 1. These settings result in the average expected node 
degree ranging approximately from 22 to 29. 




(a) JCDC,w = (b) JCDC,w = 
5 1.5 



(c) SC 




(e) CASC 



(f) CESNA 



(g) BAGC 


Figure 1: Performance of different methods measured by normalized mutual information as a function of r 
(out-in probability ratio) and p (feature signal strength). 


For each node i, we generate p = 2 features, with one “signal” feature related to the community structure 
and one “noise” feature whose distribution is the same for all nodes. The “signal” feature follows the 
distribution N(p, 1) for nodes in community 1 and N(—p,l) for nodes in community 2, with p varying 
from 0.5 to 2 (larger p corresponds to stronger signal). For use with CESNA, which only allows categorical 
node features, we discretize the continuous node features by partitioning the real line into 20 bins using 
the 0.05, 0.1,..., 0.95-th quantiles. For the JCDC, based on the study of the tuning parameters in the 
Supplemental Material, we use a = 1 and compare two values of w n , w n = 1.5 and w n = 5. Finally, 
agreement between the estimated communities and the true community labels is measured by normalized 
mutual information, a measure commonly used in the network literature which ranges between 0 (random 
guessing) and 1 (perfect agreement). For each configuration, we repeat the experiments 30 times, and record 
the average NMI over 30 replications. 

Figure [T] shows the heatmaps of average NMI for all methods under these settings, as a function of r and 
p. As one would expect, the performance of spectral clustering (c), which uses only the network information, 
is only affected by r (the larger r is, the harder the problem), and the performance of A'-means (d), which 
uses only the features, is only affected by p (the larger p is, the easier the problem). JCDC is able to 
take advantage of both network and feature information by estimating the coefficients /3 from data, and 
its performance only deteriorates when neither is informative. The informative features are more helpful 
with a larger value of w (a), and conversely uninformative features affect perfomance slightly more with a 
lower value of w (b), but this effect is not strong. CASC (e) appears to inherit the sharp phase transition 
from spectral clustering, which forms the basis of CASC; the sharp transition is perhaps due to different 
community sizes and hub nodes, which are both challenging to spectral clustering; CESNA (f) and BAGC 
(g) do not perform as well overall, with BAGC often clustering all the hub nodes into one community. 
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5 Data applications 

5.1 The world trade network 

The world trade network connects 80 countries based on the amount of trade of metal manufactures 
between them in 1994, or when not available for that year, in 1993 or 1995. Nodes are countries and 
edges represent positive amount of import and/or export between the countries. Each country also has 
three categorical features: the continent (Africa, Asia, Europe, N. America, S. America, and Oceania), the 
country’s structural position in the world system in 1980 (core, strong semi-periphery, weak semi-periphery, 
periphery) and in 1994 (core, semi-periphery, periphery). Figures [2] (a) to (c) show the adjacency matrix 
rearranged by sorting the nodes by each of the features. The partition by continent (Figure |2](a) ) clearly 
shows community structure, whereas the other two features show hubs (core status countries trade with 
everyone), and no assortative community structure. We will thus compare partitions found by all the 
competing methods to the continents, and omit the three Oceania countries from further analysis because 
no method is likely to detect such a small community. The two world position variables (’80 and ’94) will 
be used as features, treated as ordinal variables. 



(a) A by continent 



(b) A by position ’80 



(c) A by position ’94 



(d) Continent (e) JCDC, w„ = 5 
NMI=0.54 




(f) JCDC, w n = 
1.5 

NMI=0.50 
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(g) SC 

(h) KM 

(i) CASC 

(j) CESNA 

(k) BAGC 

NMI=0.47 

NMI=0.25 

NMI=0.39 

NMI=0.26 

NMI=0.11 


Figure 2: (a)-(c): the adjacency matrix ordered by different node features; (d) network with nodes colored 
by continent (taken as ground truth); blue is Africa, red is Asia, green is Europe, cyan is N. America and 
purple is S. America, (e)-(k) community detection results from different methods; colors are mated to (d) 
in the best way possible. 

The results for all methods are shown in Figure[2] along with NMI values comparing the detected partition 
to the continents. All methods were run with the true value K = 5. 

The result of spectral clustering agrees much better with the continents than that of A^-means, indicating 
that the community structure in the adjacency matrix is closer to the continents that the structure contained 
in the node features. JCDC obtains the highest NMI value, CASC performs similarly to spectral clustering, 
whereas CESNA and BAGC both fail to recover the continent partition. Note that no method was able to 
estimate Africa well, likely due to the disassortative nature of its trade seen in Figure [2] (a). Figure [2] (e) 
indicates that JCDC estimated N. America, S. America and Asia with high accuracy, but split Europe into 
two communities, since it was run with K = 5 and could not pick up Africa due to its disassortative structure. 
Table [l] contains the estimated feature coefficients, suggesting that in 1980 the “world position” had the most 
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Table 1: Feature coefficients estimated by JCDC with w = 5. Best match is determined by majority vote. 


Community 

Best match 

Position ’80 

Position ’94 

blue 

Europe 

0.000 

0.143 

red 

Asia 

0.314 

0.127 

green 

Europe 

0.017 

0.204 

cyan 

N. America 

0.107 

0.000 

purple 

S. America 

0.121 

0.000 


influence on the connections formed by Asian countries, whereas in 1994 world position mattered most in 
Europe. 

5.2 The lawyer friendship network 

The second dataset we consider is a friendship network of 71 lawyers in a New England corporate law 
firm (t?TI) . Seven node features are available: status (partner or associate), gender, office location (Boston, 
Hartford, or Providence, a very small office with only two non-isolated nodes), years with the firm, age, 
practice (litigation or corporate) and law school attended (Harvard, Yale, University of Connecticut, or 
other). Categorical features with M levels are represented by M — 1 dummy indicator variables. Figures [3] 
(a)-(g) show heatmap plots of the adjacency matrix with nodes sorted by each feature, after eliminating six 
isolated nodes. Partition by status (Figure [3](a)) shows a strong assortative structure, and so does partition 
by office (Figure [3](c)) restricted to Boston and Hartford, but the small Providence office does not have any 
kind of structure. Thus we chose the status partition as a reference point for comparisons, though other 
partitions are certainly also meaningful. 

Communities estimated by different methods are shown in Figure[3] (i)-(o), all run with K = 2. Spectral 
clustering and iF-means have equal and reasonably high NMI values, indicating that both the adjacency 
matrix and node features contain community information. JCDC obtains the highest NMI value, with 
w n = 5 performing slightly better than w n = 1.5. CASC improves upon spectral clustering by using the 
feature information, with NMI just slightly lower than that of JCDC with w n = 1.5. CESNA and BAGC have 
much lower NMI values, possibly because of hub nodes, or because they detect communities corresponding 
to something other than status. 

The estimated feature coefficients are shown in Table [2j Office location, years with the firm, and age 
appear to be the features most correlated with the community structure of status, for both partners and 
associates, which is natural. Practice, school, and gender are less important, though it may be hard to 
estimate the influence of gender accurately since there are relatively few women in the sample. 

Table 2: Feature coefficients fik, JCDC with w n = 5. 

Comm. gender office years age practice school 

partner 0.290 0.532 0.212 0.390 0.095 0.000 

associate 0.012 0.378 0.725 0.320 0.118 0.097 


6 Discussion 

Our method incorporates feature-based weights into a community detection criterion, improving detection 
compared to using just the adjacency matrix or the node features alone, if the cluster structure in the features 
is related to the community structure in the adjacency matrix. It has the ability to estimate coefficients 
for each feature within each community and thus learn which features are correlated with the community 
structure. This ability guards against including noise features which can mislead community detection. The 
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(e) A by age 


(a) A by status 


(b) A by gender 


(c) A by office 


(d) A by years 



(f) A by practice 


(g) A by school 


(h) Status 


(i) JCDC, w n = 5 (j) JCDC, w n = 1.5 

NMI=0.54 NMI=0.50 







(k) SC 
NMI=0.44 


(1) KM 
NMI=0.44 


(m) CASC 
NMI=0.49 


(n) CESNA 
NMI=0.07 


(o) BAGC 
NMI=0.20 


Figure 3: (a)-(g): adjacency matrix with nodes sorted by features; (h): network with nodes colored by status 
(blue is partner, red is associate); (i)-(n): community detection results from different methods. 


community detection criterion we use is designed for assortative community structure, with more connections 
within communities than between, and benefits the most from using features that have a similar clustering 
structure. 

This work can be extended in several directions. Variation in node degrees, often modeled via the degree- 
corrected stochastic block model (j4]) which regards degrees as independent of community structure, may in 
some cases be correlated with node features, and accounting for degree variation jointly with features can 
potentially further improve detection. Another useful extension is to overlapping communities. One possible 
way to do that is to optimize each summand in JCDC (1.2) separately and in parallel, which can create 
overlaps, but would require careful initialization. Statistical models that specify exactly how features are 
related to community assignments and edge probabilities can also be useful, though empirically we found no 
such standard models that could compete with the non-model-based JCDC on real data. This suggests that 
more involved and perhaps data-specific modeling will be necessary to accurately describe real networks, 
and some of the techniques we proposed, such as community-specific feature coefficients, could be useful in 
that context. 
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Appendix 

A.l Choice of tuning parameters 

The JCDC method involves two user-specified tuning parameters, a and w n . In this section, we investigate 
the impact of these tuning parameters on community detection results via numerical experiments. 

First we study the impact of a, which determines the algorithm’s preference for larger or smaller commu¬ 
nities. We study its effect on the estimated community size as well as on the accuracy of estimated community 
labels. We generate data from a stochastic block model with n = 120 nodes and K = 2 communities of sizes 
ni and n -2 = n — n\. We set the within-community edge probabilities to 0.3 and between-community edge 
probabilities to 0.15, and vary m from 60 to 110. Since a is not related to feature weights, we set features 
to a constant, resulting in unweighted networks. The results are averaged over 50 replications and shown in 
Figure [4] 



(a) The size of the larger estimated community 


(b) Community detection accuracy 


Figure 4: (a) The size of the larger estimated community as a function of the tuning parameter a. (b) 
Estimation accuracy measured by NMI as a function of the tuning parameter a. Solid lines correspond to 
JCDC and horizontal dotted lines correspond to spectral clustering, which does not depend on a. 


We report the size of the larger estimated community in Figure |4]j a), and the accuracy of community 
detection as measured by normalized mutual information (NMI) in Figure |4](b). For comparison, we also 
record the results from spectral clustering (horizontal lines in Figure [4]), which do not depend on a. When 
communities are balanced (ni = n-i = 60), JCDC performs well for all values of a, producing balanced 
communities and uniformly outperforming spectral clustering in terms of NMI. In general, larger values of a in 
JCDC result in more balanced communities, while smaller a’s tend to produce a large and a small community. 
In terms of community detection accuracy, Figure |4])b) shows that the JCDC method outperforms spectral 
clustering over a range of values of a, and this range depends on how unbalanced the communities are. For 
simplicity and ease of interpretation, we set a = 1 for all the simulations and data analysis reported in the 
main manuscript; however, it can be changed by the user if information about community sizes is available. 

Next, we investigate the impact of w n , which controls the influence of features. To study the trade¬ 
off between the two sources of information (network and features), we generate two different community 
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partitions. Specifically, we consider two communities of sizes n\ and n, 2 , with n\ + ri 2 = n = 120. We 
generate two label vectors c A and c F , with cf = 1 for i = 1,..., n\ and cf = 2 for i = n\ + 1,..., n, while 
the other label vector has cf = 1 for i = 1,..., ri 2 and cf = 2 for i = ri 2 + 1,..., n. Then the edges are 
generated from the stochastic block model based on c A , and the node features are generated based on c F . 
We generate two node features: one feature is sampled from the distribution N(n, 1) if cf = 1 and N( 0,1) 
if cf = 2; the other feature is sampled from N( 0,1) if cf = 1 and 1V(—//, 1) if cf = 2. We fix /z = 3 and set 
a = 1, as discussed above. We set the within- and between-community edge probabilities to 0.3 and 0.15, 
respectively, same as in the previous simulation, and vary the value of w n from 1.1 to 10. Finally, we look at 
the the agreement between the estimated communities e and ca and cp, as measured by normalized mutual 
information. The results are shown in Figure [5] 



w„ 


Figure 5: MNI between the estimated community structure e and the network community structure ca (solid 
lines) and the feature community structure cp (dotted lines). Note that when n\ = n ,2 = 60, c A = c F , so 
the solid and dotted lines coincide. 

As we expect, smaller values of w n give more influence to features and thus the estimated community 
structure agrees better with c F than with c A . As w n increases, the estimated e becomes closer to c A . In the 
manuscript, we compare two values of w n , 1.5 and 5. 

A.2 Proofs 

We start with summarizing notation. Let be the estimated communities corresponding to the 

label vector e, and C i,... ,Cp the true communities corresponding to the label vector c. Recall we estimate 
e by maximizing the criterion R over e and /3, where 

K \ 

k =1 |tfc| i,j££ k 


and define 


e = argmax ( maxR(e,/3;m„) 
e \ p 
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where e and the corresponding 0 are defined up to a permutation of community labels. Recall that we 
assumed A and F are conditionally independent given c and defined R°, the “population version” of R, as 


K 


1 


i?°(e,^;u; n ) = ^ ^ p n P CiCj E[W (<j> i:j , 0 k ; w n )] . 

fc=1 ^ i.jeffc 

The expectation in R° is taken with respect to the distribution of node features, which determine the 
similarities (f>ij. 

Lemma 2. Under conditions [7] ondQ if w n p n — > oo and 0 < a < 2, we have 


max ■ 

e,/3 


\R{e, 0\w n ) - R°(e,0;w n )\ 

WnPn,n 2 ~ a 


= Or, 


y/^nPr 


Proof of Lemma [1| We first bound the difference between R and R° for fixed e and 0. By Hoeffding’s 
inequality and the fact that 2[n/2] > n — 1, where [a;] is the integer part of x, we have 


|£*l ! 


^ ^ (AijW((fij, 0k', w n ) p n Pc, c :j ^[I'l" ( dpj , 0k 5 V .' n )] j 
i,je£k 

Taking t = w n p n n 2 ^ a \£k\ a ~ 2 S and applying the union bound, we have 
'| R(e,0]w n ) - R o (e,0-,w n )\ 


> t > < 2 exp (~(\£k\ - 1 )t 2 ) ■ 


W n p n n 


2 -a 


> KS 


K 


k= 1 
K 


( AijW((f>ij,0k',w n ) p n P CiCj ¥j [W((f>ij, 0k ; ic„)]) 


w n p„\£k\ a n 2 “ 


> 6 


<^2exp{-(|£’ fe | - i)^^n 4 2a \£ k \ 2a 4 S 2 } < 2I<exp {-(n 0 n - l)wlp 2 n S 2 } . 
k =1 

Next, we take the uniform bound over 0. Consider the set 
B f = 


Vp Vp 


Sie , ■ ■ ■, ~ ) ,si,... ,s p e { o,±i,... ,± 




, ± 


M Py /P\ 


1 


It is straightforward to verify that B e is an e-net on [—Alp, Mp] p , the space of 0 k s. For each 0 k , let 0(0k, Bf) 
be the best approximation to 0 k in B e . Then 


max | W((f>ij , 0k ; w n ) - W(cj) ij ,0(0 k , B e )-,w n )\ < max 


dW 

d0 k 


(4>ij 5 0k : Wn ) 


\0k ~ 0(0k, B e )\ 


< 2M ( j ) Mp exp(MpMp)e < 2M ( / ) Mpw n e 
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Therefore, choosing e = , we have 


if 


^ \R(e,P;w„) - R°(e,p-w n )\ \ ^ 

V P w n p n n 2 ~ a ) ^ 


k—\ 


max 

Pk 


Si j'effc (^*J PnPciCj)W , Pk'i W n ) 


U>nPn\£k\ a n 2 


> 5 


if 


<E P 

fc=i 

K 

+ E 


max 

/3fc 


i,j££ k \Aij PnPciCjW^V (fiij t Pki Wn) > P(Pk -> -^e) 5 ^n) | J 


fc=l 


max 

(3q(EB £ 


E 


ij(=£ fc i^-ij Pn^CiCj) W ($ij ? A)5 ^n) 


>2 


WnPn|ffe|“n ! 


0:^2—o: 


> 


<AT 


|£ fe | 2 -“ • 2M ( pMpt 


<0 + 2K 


p n n 2 ~ a 

( AM'pMpy/p 


> - ) + 2/f|B e | exp {-(7r 0 n - l)w 2 p 2 <5 2 /4} 




+ 3 exp{-(7r 0 n- l)ic 2 p 2 J 2 /4} , 


where the first term becomes 0 because of the choice of e and |£fc| < n. Finally, taking a union bound over 
all possible community assignments, we have 

\ p 


m« > Kf , < 2K „ +1 


: 

V e,/3 


WnPnU 


2 —a 


( 4 


+ 3 exp {-(7r 0 n - l)w 2 p 2 (5 2 /4} 


l Pr>3 

< 2K exp [-Tr 0 nwlp 2 n S 2 /8 + n log K + plog{Ci/(p n <5)}] 


□ 


where C\ := 4 M^M^y/p. Taking 6 = 1 / y/w n p n completes the proof of Lemma [2j 

We now proceed to investigate the “population version” of our criterion, R°. Define U £ by 

Uki = EIU l[ e * = °i = l\/ n i and let P be a diagonal K x K matrix with 7Ti,... ,7 tr- on the diagonal, 
where 7T/. = Ei=i 1 [c* = k]/n is the fraction of nodes in community Ck■ Roughly speaking, U is the confusion 
matrix between e and c, and U = DO for a permutation matrix O means the estimation is perfect. Define 


n ( T T\ - V EE EEi UkiUki'Pw 
9( ^ 


k =1 


Each estimated community assignment e induces a unique U = [/(e). It is not difficult to verify that 

Lemma 3. Under conditions 1 and 2, there exists a constant C 2 such that 


max 

e ,/3 


i? u ( e , /3; w n ) 


WnpnTl “ 


z^-9(U(e)) 


< 


C 2 


w n 


Proof of Lemma [3| By definition, we have 
R°(e,P\w. 


max 

e,/3 


w n p n n 

K 


2 - 0 ! 


- (17(e)) 


K 


= max 


Et E p ^. 

k =1 k i,j€Ek 


E[exp(-<^,/3 fc ))] 


l&l‘ 


w n n 


2—a 


sr- exp (M^Mp) K exp(M^M^) C 2 

< maX E E If. la... ^.2 —a ^ - ... _2-a - m&X Pkl = ~ , 


W n TT, 


n n 0 


fcZ 


, , . ISfch'U’nn 2 “ kl 

fe=l2,jGCfc 

where C 2 := Kttq~ 2 exp (M^Mp) maxfc; Pj,;, and the two inequalities follow from conditions [l] and [ 2 J respec¬ 
tively. D 
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Lemma 4. Under condition 0 if a £ [maxi< fc< KK 2 (K - 1 )Pki/ min(P fcfc , P u ), 1], then for all U satisfying 
Ehi Uki = Ki for 1 < k < K, g(U ) is uniquely maximized at U = DO for O £ Ok, where Ok denotes the 
set of K x K permutation matrices. 


Proof of Lemma [^| We have 
k / K 


K v-if tt2 


Eh U 2 u Pn + Eh Ei'# UkiUki'Pi 


k= 1 


g(D)-g{U)=Yl EM p /^E 

1 = 1 \k= 1 ) 

K f ( K \ 2 ~ a K 

=E EM -E 


(j2a=l Uka) 


JJ 2 

u kl 


K K 


UkiUw 


i=i i \fc=i 


k=1 


(Ea=l U *») “ j PU S S S l (EE Pu) ' 


Pw (6.1) 


( K \ 2 —o ^ 

X)fe=i Uki ) > Efc=i Ukf a - By mid-value theorem, 

there exists f k i £ (o, E a # Pm), such that 


K 


K 


EM -u& = <* e^» }/( u ki+z kl ) “>« EM / E^ 


, a^l 


a^l 


Finally, we will need the following inequality: for 0 < a < 2 and x, y > 0 satisfying x + y < u, 

x 2 ~ a (u — x) + y 2 ~ a (u — y)> xyu 1-0 . 


For x = y = 0, equality holds. To verify (6.3) when 0 < a; + y < u, dividing by u 3 “ we have 
x 2 ~ a (u — x) + y 2 ~ a (u — y) — xyu 1 ~ c 


w 


3—o 


PDPEMiDV!) 


xy_ 

4/2 


( x ) a (l- x ) + ( v -\(l- v -) 

\uJ V uJ \uJ V w/ 


2 


-^ 11 - 


w/ u 


The first inequality above implies that a necessary condition for equality to hold in (6.3) is xy = 0. 
We now lower bound the first term on the right hand side of (6.1). 

2—a 


K [ / K 


K 


E EM -E 


U 2 
u kl 


K K 


1=1 [ \k=1 

7-2 — 0 


fc=l 


* * <J« 

^EE- 

Z=1 fc=l 
K ( K 

=E EE 

fc=l I Z=1 l'# 


Uka 


{EhUk. 


P «^EE 


K K U, 


E 0 =i 


aP « > E E 


kl 


1=1 k —1 




hr { 

(Ea=l Pa) 

'“- p fc “z} 


(Eo=l Uka j 

O 

1 


Pzz 


Z=l fc=i 


Ea=l P™ 


P 


2—a 
kl 


(E a# Uka 


Ea=l Pa 


E 2P «' 

Pw 1 ^^ c/fc2r ( Ea ^' c/fca ) p “' 


if if 

=EEE 

fc=1 Z=1 Z'^Z 


Z'=l Z^Z' Ea=l Uka 

utr{Ea#u ka )+ulr(Y. a #u ka ) K K 


Ea=l Uka 


- p «'^EEE 

fc=l Z=1 Z'^Z 


UkiUw 


(£a=l Pa) 


.Pz 


O - 1 it' ? 


where the last equality is obtained by applying ( |6.3[ ) with a; = Uki, y = Ukv and u = Ehi Uka- Plugging 
(6.4) into (6.1), we have 

g(D)-g(U)> 0. 


( 6 . 2 ) 


(6.3) 


(6.4) 
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It remains to show that equality holds only if U = DO for some O £ Ok- Note that the last inequality 


in (6.4) is obtained from (6.3), where equality holds only when xy = 0. The corresponding condition for 


equality to hold in (6.4) is thus UkiUw = 0 for all k. I and l'. Therefore, for each k, there is only one l such 
that U[~i 0, i.e., U = DO for some O £ Ok- □ 

Proof of Theorem 1. By Lemma [2] and Lemma [3j we have 


max 

e,/3 


RK^_ sme)) 

WnPnH 2 “ 


= On 


V W n p r 


(6.5) 


It is straightforward to verify that, for any e, 2 d(e, c) = mino e o K \\U(e)—DO\\i, where ||Q||i = Y2k= i Xwli \Qkl\ 
Take a sequence of decreasing positive numbers x n —» 0 and define 


y n = max min \\U — DO\\i 
U-.g(D)-g(U)<x n OeOK 


( 6 . 6 ) 


We now show, by contradiction, that x n —> 0 implies y n —> 0. First, note that y n is non-increasing. Now if 
yo = linin^oo y n > 0, by compactness of the set U Va = {U : min o^O K ||bf — DO ||i > yo} and continuity of 
the function g , the supremum of g(U) over U £ U Vo , which equals g(D ), is attained in U Vo . This contradicts 
Lemma [4j 

Now let x n = 1/ f/w n p n . By assumption of Theorem 1, x n —> 0, which yields y n —> 0. Also x n / (l /y/w n p n ) = 
f/w n p n —> oo, so by (6.51 we have 


w n p n n z a 


> 


u 


R(c, /3; w n ) 

-“ 9 [D) 

w n p n n- a 


> 


(6.7) 


Now, the event 


-9 me)) 

< ^ and 

R(c,/3;w n ) 

2 a 9 ( D ) 

WnpnU 2 “ 

2 

WnPnJl 2 “ 


< 


implies that g(D) - g{U(e)) < + x n < x n . So we have 


WnPnn 

' (9(D) ~ 9(U(e)) < x n ) -> 1 


( 6 . 8 ) 


and 


2 d(e,c) = min MJ(e) — DO\\\< max min \\U — DO\\i = y„ 

y ' OsOif w U-.g(D)-g(U)<x rl OGOk W 


0. 


a 
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